This is the introduction to my final assignment for the open data science I will be working on the alcohol data set

Loading the data set and packages

#clear environment
rm(list=ls())

#read data
fpath<- "C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-final/data/"
alco_data <- read.table(paste0(fpath, "alcohol_student.csv"), sep=",", header=T)
alco_data<- alco_data[,-1]

#necessary packages
#install.packages("dismo")
library(dplyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(tidyr)
library(tidyverse)
library(gbm)
library(dismo)
library(caTools)
library(mgcv)
library(MASS)
library(FactoMineR)

AIMS AND OBJECTUIVES

My aim is to demonstrate the use of various statistical techniques in getting insight in to the dataset, to understand the patterns alcohol consumption consumptions and absences amongst students in two schools. Further, I seek to explore the distribution of grades amongst male and female students; and factors that might be affecting this. Firstly, I will use the basic desctiptive statitstics to understand the distribution, correlation and dependencies(cor, barplot, histogram, biplot etc) I will also be using the regression models(Generalized Linear Model(GLM), Generalised Additive Model(GAM) and Generalised Boosted Model(GBM)). Furthermore, I will test my models using 70:30 data split.

I will also explore Linear Discriminant Analysis(LDA) and Multiple Correspondence Analysis(MCA), to classify and categorise the data.

RESEARCH QUESTIONS

DATA EXPLORATION

categ = apply(alco_data, 2, function(x) nlevels(as.factor(x)))
categ
##     school        sex        age    address    famsize    Pstatus 
##          2          2          7          2          2          2 
##       Medu       Fedu       Mjob       Fjob     reason    nursery 
##          5          5          5          5          4          2 
##   internet   guardian traveltime  studytime   failures  schoolsup 
##          2          3          4          4          4          2 
##     famsup       paid activities     higher   romantic     famrel 
##          2          2          2          2          2          5 
##   freetime      goout       Dalc       Walc     health   absences 
##          5          5          5          5          5         26 
##         G1         G2         G3    alc_use   high_use 
##         14         15         18          9          2
dim(alco_data)
## [1] 382  35
str(alco_data)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
summary(alco_data)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 
#glimpse(alco_data)

# ####Grade:
#bar plot of the variables
gather(alco_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() 

To avoid, confusion, I will copy the data into a new data frame for MCA. Firstly, I will be extracting the categorical variables, as factors

#To avoid, confusion, I will copy the data into a new data frame for MCA
data_mca <- alco_data

#Firstly, I will be extracting the categorical variables, as factors
#now, filter them out for the MCA
data_mca_fac1<-Filter(is.factor, data_mca)

high_use<-factor(alco_data[,"high_use"])
data_mca_fac1 <- cbind.data.frame(data_mca_fac1, high_use)

MCA plot

#par(mfrow=c(1,2))
#now, perform the MCA on the catergorial/qualitative variables
mca_alco1 <- MCA(data_mca_fac1, graph = T)

#plot it
plot(mca_alco1, invisible=c("ind"), habillage="quali")

#par(mfrow=c(1,1))

BETTER VISUALISATION FOR MCA

# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ = apply(data_mca_fac1, 2, function(x) nlevels(as.factor(x)))
categ
##     school        sex    address    famsize    Pstatus       Mjob 
##          2          2          2          2          2          5 
##       Fjob     reason    nursery   internet   guardian  schoolsup 
##          5          4          2          2          3          2 
##     famsup       paid activities     higher   romantic   high_use 
##          2          2          2          2          2          2
# data frame with variable coordinates
mca_vars_df = data.frame(mca_alco1$var$coord, Variable = rep(names(categ), categ))
#mca_vars_df


# data frame with observation coordinates
mca_obs_df = data.frame(mca_alco1$ind$coord)


# MCA plot of observations and categories
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.0) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca_vars_df, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca_vars_df), colour = Variable)) +
  ggtitle("MCA plot of variables using R package FactoMineR") +
  scale_colour_discrete(name = "Variable")

Here, we can see that female students generally attend school because of her reputation and get more school support. They also have family support and are able to attend paid extra classes. Most of their father’s job are in the health sector. High alcohol consumption is more rampant amongst female students compared to their male counterparts with high alcohol use. male students also attend the school based on course preference and other reasons. By and large, they do not attend paid extra classes compared to the female students and also do not get family and school support like the frmale students. Lastly, they mosly have no intention of pursuing higher education. This is explored further by categorising more continuous variables such as grades, absences, to allow for comparison with other variables.

CATEGORISE MORE VARIABLES

Next, I will be categorising grade(G3), absences, Mother’s education(Medu), father’s education

#Next, I will be categorising grade(G3), absences, Mother's education(Medu),
#father's education
data_mca2<- alco_data

#now, convert mother and father's education level into something more understandable.
#data_mca$Medu<- factor(data_mca$Medu)
data_mca2$Medu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
data_mca2$Fedu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))

#Now, let's categorise grades according to quartiles
bins_abs<- quantile(data_mca2$absences)
data_mca2$absences<-cut(data_mca2$absences, breaks=bins_abs, include.lowest=T,
                       labels=c("vlow", "low", "high", "vhigh"))

#same to grade(G3)
bins_gra<- quantile(data_mca2$G3)
data_mca2$G3<-cut(data_mca$G3, breaks=bins_gra, include.lowest=T,
                 labels=c("vlow", "low", "high", "vhigh"))

#Getting the columns that are factors
#since MCA is for categorical variables, I will be filtering the categorical
#variables/factors and also categorise some of the variables of interest
#such as absences, grade(G3). I will also Categorise other variables of interest that are in integer forms.

#let's first see the variables already in factor format
names(Filter(is.factor, data_mca2))
##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"    
## [11] "nursery"    "internet"   "guardian"   "schoolsup"  "famsup"    
## [16] "paid"       "activities" "higher"     "romantic"   "absences"  
## [21] "G3"
#now, filter them out for the MCA
data_mca_fac2_<-Filter(is.factor, data_mca2)

#include the high alcohol use column
high_use<-factor(alco_data[,"high_use"])

#join with the dataframe with categorical varianles
data_mca_fac2_<-cbind.data.frame(data_mca_fac2_, high_use)
str(data_mca_fac2_)
## 'data.frame':    382 obs. of  22 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
##  $ Fedu      : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ absences  : Factor w/ 4 levels "vlow","low","high",..: 2 2 4 2 3 2 1 1 4 4 ...
##  $ G3        : Factor w/ 4 levels "vlow","low","high",..: 2 1 2 1 2 2 1 1 4 1 ...
##  $ high_use  : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
#The above can also be done with dplyr which has been loaded already too
#data_mca %>% Filter(f = is.factor)

#names of the variables again ?
names(data_mca_fac2_) 
##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"    
## [11] "nursery"    "internet"   "guardian"   "schoolsup"  "famsup"    
## [16] "paid"       "activities" "higher"     "romantic"   "absences"  
## [21] "G3"         "high_use"
#Alternative for finding the column name that are factors
# names(data_)[ sapply(data_reg, is.factor) ]
# #or
# data_reg %>% Filter(f = is.factor) %>% names
# or
# data_reg %>% summarise_each(funs(is.factor)) %>% unlist %>% .[.] %>% names


#I will select few of the variables for clarity sake and include the sex too
data_mca_fac2<-data_mca_fac2_[,c("sex",names(data_mca_fac2_[,(10:ncol(data_mca_fac2_))]))]
    
#now, perform the MCA on the catergorial/qualitative variables
mca_alco2 <- MCA(data_mca_fac2, graph = T)

#plot it
plot(mca_alco2, invisible=c("ind"), habillage="quali")

# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ2 = apply(data_mca_fac2, 2, function(x) nlevels(as.factor(x)))
categ2
##        sex     reason    nursery   internet   guardian  schoolsup 
##          2          4          2          2          3          2 
##     famsup       paid activities     higher   romantic   absences 
##          2          2          2          2          2          4 
##         G3   high_use 
##          4          2
# data frame with variable coordinates
mca_vars_df2 = data.frame(mca_alco2$var$coord, Variable = rep(names(categ2), categ2))
#mca_vars_df2


# data frame with observation coordinates
mca_obs_df2 = data.frame(mca_alco2$ind$coord)


# MCA plot of observations and categories
ggplot(data = mca_obs_df2, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.1) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca_vars_df2, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca_vars_df2), colour = Variable)) +
  ggtitle("MCA plot of variables") +
  scale_colour_discrete(name = "Variable")

Here, we can see the categorisation and association In the NW quadrant, it shows that those that pay for for extra classes in portugese and math, have family support and are mostly female students. They also get extra education support and have ambition for higher education.

In the NE quadrant, high alcohol use seem to be associated with guardian other than mother or father. Those with high alcohol use also seem to not have ambition for higher education. They also perform poorly in studies(low grade). and mostly choose school because it is closer to home amongst other reasons, other than reputation and course preference. They also do not seem to participate in extracurricular activities and are mostly in romantic relationship. They also high absences.

In the SW quadrant, it can be seen that those that have low absence perform very well in studies do not take excessive alcohol. They are also engaged in extracurricular activities and have their mother as their guardian. They also motivated to attend the school because of shool’s reputation.

In the SE quadrant, Male students seem to be more moderately absent and do not attend paid extra classes in portuguese and math. They mostly attend the school because of the course preference. They generally also have no internet access. They also mostly do not get school’s support. It also looks like their father are their guardian.

This will be further explored with family of regression models (GLM, GAM, GBM)

REGRESSION MODELS(DLM, GAM, GBM)

Before I proceed, there is need to determine the error distribution of the response variables here. As in many realistic cases, we do not have normal distribution(i.e gaussian), we could also have poisson distribution for count data and last is the binomial distribution (e.g Yes/No, Presence/Absence). That said, one of the assumptions of poisson distribution is that variance is greater than mean. Therefore, I will be creating a function to test this assumption and decide the kind of distribution. You can see more here

Create function to test variance>mean assumption of poisson distribution

#Create function to test variance>mean assumption of poisson distribution
test.var.mn<- function(x){
  if(var(x)>mean(x)){
    print("VALID:The variance>mean assumption of poisson distribution is valid")
  }
  else{
    print("INVALID:The variance>mean assumption of poisson distribution is invalid")
  }
}

TEST THE ASSUMPTION OF POISSON DISTRIBUTION ON AGE AND ABSENCES

#see if Grade(G3) meets the assumption
test.var.mn(alco_data$G3)
## [1] "INVALID:The variance>mean assumption of poisson distribution is invalid"
#next, see if "absences" meets the assumption
test.var.mn(alco_data$absences)
## [1] "VALID:The variance>mean assumption of poisson distribution is valid"

It shows here that it is invalid for grade(G3) but valid for “absences” variable. Next, therefore, I will be exploring various regression models, with grade(G3) as gaussian, Absences as poisson and high alcohol use(High_use) as binomial.

As done earlier, the data will be copied into a new dataframe for all the regression analysis, to avoid confusion and alteration of the original data.

For GLM, predictors will be selected, by firstly using stepwise regression to eliminate the redundant variables. Afterwards, I will be employing the significance test from the model summary, anova test(Chi Square), and AIC values. I also checked if any of the predictor is curvillinear(i.e has higher order polynomial).

Anova Chi square test and the T.test from the model summary were used to further eliminate the insiginificant variables at level 0.05.

Here, we can see that the significant factors affecting students’ grades include, address, mother’s level of education, studytime, ambition to further higher education, and going out. Mother’s education and higher education ambition seem to be the most significatn factors. However, mother’s education level is not curvillear(i.e does not have higher order polynomial(i.e “I(Medu^2)”)

In accordance to the principle of parsimony, I decided to use variables with significance level of 0.05 Therefore, Final model: grade_glm<-glm(G3~ address + Medu + studytime + higher + goout,data=data_reg,family =“gaussian”)

let’s also see the if any observation has a leverage and to further test the normality assumption while I also explore the how the residuals differ from the fitted

par(mfrow=c(2,2))
plot(grade_glm, which = c(1,2,5))

from this, the constant variance of the error seem to be in line. The distribution also appears to be normal from the Normal Q-Q plot although, slight divergence from the line at the lower level. Also, no observation seem to have a clear leverage over others aside a point lying quite farther from others. Nevertheless, gaussian distribution can be safely utilised for modelling the grde(G3) response variable.

Mean Error and RMSE

Create Functions to calculate Mean Error and Root Mean Square Error(RMSE)

# function to calculate the mean absolute and RMSE
#function to calculate mean error
mean_error<- function(obs, pred){
  me<-mean(abs(obs-pred))
  return(me)
}

# Function that returns Root Mean Squared Error
rmse <- function(obs, pred){
  rmse<-sqrt(mean((obs-pred)^2))
  return(rmse)
}

BOOSTRAPPING AND REGRESSION MODELLING(GLM, GAM & GBM).

EVALUATION AND TESTING OF MODELS

MODELLING GRADE(G3).

This part is for the model validation. The data was divided into 70% training/calibration data and 30% testing/evaluation data. It was also boostrapped 10 times. The results of the models and the response curves were thereafter presented. dividing into 70:30

VALIDATION OF REGRESSION MODELS FOR GRADE(G3)

let’s see the correlation between the predicted and observed response variable

#correlation between the predicted and observed grade(G3)
all_cor_grade
##   grade_glm grade_gam grade_gbm1 grade_gbm2
## 1 0.2197972 0.2225753  0.1371433  0.1438754

They all are low and not so much different.

**Below, we can see the errors of the various models.

#error of all the models
all_error_grade
##                grade_glm grade_gam grade_gbm1 grade_gbm2
## mean abs error  2.772634  2.781349   2.745613   2.757907
## RMSE            3.456914  3.454702   3.471570   3.467137

There appears to be not much difference in the errors of all the models used.

GAM and GBM for GRADE(G3) MODELLING

I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors. GAM is also able to show response curves and their confidence intervals.

SUMMARY OF THE GBM FOR GRADE(G3)

summary(grade_gbm1)

##                   var    rel.inf
## higher         higher 25.7386380
## Medu             Medu 12.6581772
## alc_use       alc_use  8.6389416
## absences     absences  8.0214298
## age               age  6.5330093
## studytime   studytime  6.3548883
## traveltime traveltime  6.2590093
## goout           goout  5.8327510
## Fedu             Fedu  3.9693013
## address       address  3.6344579
## freetime     freetime  2.9256404
## famsup         famsup  1.9842556
## sex               sex  1.8522762
## internet     internet  1.7561133
## romantic     romantic  1.4515468
## activities activities  1.0308137
## famrel         famrel  0.9563164
## Pstatus       Pstatus  0.4024339

here, we can see that higher education pursuit, mother’s educaton level, alc use, and absences seem to be most important factors affecting students performances. This is quite in line with the results I got from GLM and GAM, however, only mother’s education level and higher_use seem to be the significant factors. The least important factors are also shown in the GBM’s summary of relative importance.

RESONSE CURVES(GAM GBM,)

now, let’s see the response curves of this from GAM

grade_gam <- gam(G3~ s(Medu, k=3) + higher + address +                         s(studytime, k=3) + higher + goout, data =                       data_reg,family = "gaussian")
plot(grade_gam, pages=1)

Here, from the smooth curve from GAM, we can see that mother’s education level has a positive effect on student’s performance. However, the confidence interval especially at the lower level. is quite large, which shows that there is a wide range of uncertainty. Perharps, more observations needed. This will be explored further in the response curves from GBM below.

best.iter1<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")

par(mfrow=c(2,2))
plot.gbm(grade_gbm1, "Medu", best.iter1)
plot.gbm(grade_gbm1, "higher", best.iter1)
plot.gbm(grade_gbm1, "alc_use", best.iter1)
plot.gbm(grade_gbm1, "absences", best.iter1)

par(mfrow=c(1,1))

As we can see here, the higher the level of education of the mother, their kids seem to perform better. Also, studentsä with intention to pursue higher education seem to perform better. Alcohol use and absences reduces performance and can result in dramatic reduction if it becomes chronic.

**PREDICTED VS OBSERVED GRAGE(G3) FROM GBM.

plot(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3, 
     main="Observed vs Predicted grade")
lines(lowess(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3), col="red", lwd=3)
r_grade <-cor.test(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3)
r2grade <- r_grade$estimate^2
r2grade
##       cor 
## 0.3244692
legend("topleft", paste("r^2=", round(r2grade,3)))

We can see from the scatterplots that the selected variables, account for only 32% factors affecting the students’ grades.

MODELLING ABSENCE(GLM, GAM, GBM)

MODEL EVALUATION AND VALIDATION.

For GLM, the selection of predictors was done as earlier by using stepwise regression, anova(Chis sq) test and significance test from the model summary.

Final model

abse_glm<-glm(absences ~ sex + age + Medu + 
                Pstatus +  studytime +  higher + 
                internet + goout + alc_use + G3
              ,data=data_reg,family ="poisson")


summary(abse_glm)
## 
## Call:
## glm(formula = absences ~ sex + age + Medu + Pstatus + studytime + 
##     higher + internet + goout + alc_use + G3, family = "poisson", 
##     data = data_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1134  -1.8059  -0.6326   0.8032   9.8400  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.170583   0.410599   0.415  0.67781    
## sexM        -0.406371   0.054637  -7.438 1.02e-13 ***
## age          0.095102   0.020967   4.536 5.74e-06 ***
## Medu         0.116579   0.024475   4.763 1.91e-06 ***
## PstatusT    -0.458253   0.069217  -6.621 3.58e-11 ***
## studytime   -0.156525   0.033700  -4.645 3.41e-06 ***
## higheryes   -0.220171   0.108890  -2.022  0.04318 *  
## internetyes  0.349942   0.078836   4.439 9.04e-06 ***
## goout        0.007283   0.023303   0.313  0.75463    
## alc_use      0.220332   0.026020   8.468  < 2e-16 ***
## G3          -0.022122   0.008040  -2.751  0.00593 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1917.2  on 381  degrees of freedom
## Residual deviance: 1622.0  on 371  degrees of freedom
## AIC: 2663.6
## 
## Number of Fisher Scoring iterations: 5
anova(abse_glm, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: absences
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        381     1917.2              
## sex        1    6.559       380     1910.7  0.010434 *  
## age        1   39.964       379     1870.7 2.587e-10 ***
## Medu       1   28.611       378     1842.1 8.848e-08 ***
## Pstatus    1   37.104       377     1805.0 1.120e-09 ***
## studytime  1   50.651       376     1754.3 1.103e-12 ***
## higher     1   10.643       375     1743.7  0.001105 ** 
## internet   1   26.691       374     1717.0 2.387e-07 ***
## goout      1   15.196       373     1701.8 9.692e-05 ***
## alc_use    1   72.308       372     1629.5 < 2.2e-16 ***
## G3         1    7.512       371     1622.0  0.006130 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The factors that might be causing absence of students are above.

TRAINING AND TESTING OF MODELS FOR “ABSENCES”

correlation between the predicted and observed

#correlation between the predicted and observed for all the models used.
all_cor_abse
##   abse_glm  abse_gam abse_gbm1 abse_gbm2
## 1 0.258263 0.3423274  0.189674 0.1973915
#The mean correlation
all_cor_abse
##   abse_glm  abse_gam abse_gbm1 abse_gbm2
## 1 0.258263 0.3423274  0.189674 0.1973915

ERRORS OF THE MODELS IN PREDICTING “ABSENCES”

#let's see the errors in the prediction for "absence" response variable.
all_error_abse
##                abse_glm abse_gam abse_gbm1 abse_gbm2
## mean abs error 3.271682 3.221611  3.572327  3.415349
## RMSE           4.396030 4.355035  4.772620  4.678464

FITTING THE REGRESSION FOR “ABSENCES” WITH GBM

I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors.

Alcohol use, age, mother’s education, parents’ status, freetime and romantic relationship ostensibly, have the most effect on the cause of absences. Travel time and address seem to be the least

RESPONSE CURVES OF “ABSENCES”

GAM

#now, let's see the response curves of this
#use all the dataset
abse_gam <- gam(absences~ sex + s(age, k=3) +s(Medu, k=3) + 
                  Pstatus+  s(studytime, k=3) +  higher + 
                  internet + goout + s(alc_use, k=3) +  s(G3,                  k=3), data = alco_data, family = "poisson")
plot(abse_gam, pages=1)

Teenage students are more likely to be absent although, there seems not to be enough data for older students above 20, as the confidence interval’ seem high. Also. the likelihood of absences reduces with increase in the level of education of the mother. Studytime also reduces this. While Alcohol increases this. It is quite unsure how grade affects.

Furthermore, I will be expploring this further with GBM reponse curves.

RESPONSE CURVES(GBM) FOR “ABSENCES”

best.iter1<-gbm.perf(abse_gbm1, plot.it = F, method = "OOB")


par(mfrow=c(2,3))
plot.gbm(abse_gbm1, "alc_use", best.iter1)
plot.gbm(abse_gbm1, "age", best.iter1)
plot.gbm(abse_gbm1, "Medu", best.iter1)
plot.gbm(abse_gbm1, "Pstatus", best.iter1)
plot.gbm(abse_gbm1, "freetime", best.iter1)
plot.gbm(abse_gbm1, "romantic", best.iter1)

par(mfrow=c(1,1))

Alcohol use, age mother’s education seem to increase the tendency to be absent while freetime does the opposite. This is quite surprising as, I had expected the exact opposite. Also, students’ with parents apart have more tendency to be absent that those with their parents together. Likewise, students in romantic relationship are more likely to be absent than those without.

PREDICTED VS OBSERVED ABSENCES(COEFFICIENT OF DETERMINATION.)

plot(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences, 
     main="Observed vs Predicted absences")
lines(lowess(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences), col="red", lwd=3)
r_abse <-cor.test(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences)
r2abse <- r_abse$estimate^2
r2abse
##       cor 
## 0.2407582
legend("topleft", paste("r^2=", round(r2abse,3)))

We can see from the scatterplots that the selected variables, account for only 23%(the coefficient of determination) factors affecting the students’ absences.

MODELLING HIGH ALCOHOL CONSUMPTION.

Similar step was repeated here. However, in evaluating the models, I utilised Area Under Curve(AUC), odds ratio and confusion/classification matrix, instead.

Final model(GLM) for high alcohol use

halc_glm<-glm(high_use ~sex + studytime + goout + absences
              ,data=data_reg,family ="binomial")

anova(halc_glm, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        381     465.68              
## sex        1   14.732       380     450.95 0.0001239 ***
## studytime  1   10.242       379     440.71 0.0013731 ** 
## goout      1   46.759       378     393.95 8.027e-12 ***
## absences   1   12.641       377     381.31 0.0003775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AUC values through the various iterations.
compared_model_halc
##    halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 1     0.8112348    0.8112348     0.8019568     0.7914980
## 2     0.7619826    0.7619826     0.7505447     0.7302106
## 3     0.7307185    0.7307185     0.7518128     0.7471984
## 4     0.7985957    0.8015521     0.8311160     0.8226164
## 5     0.6938998    0.6938998     0.7116921     0.7069717
## 6     0.7460152    0.7460152     0.7453222     0.7706168
## 7     0.6776786    0.6776786     0.6967857     0.7039286
## 8     0.7596078    0.7580392     0.7788235     0.7698039
## 9     0.6756912    0.6756912     0.6728111     0.6562980
## 10    0.7623529    0.7631373     0.7694118     0.7384314
#The mean AUC values for the various models
mean_auc_halc<-colMeans(compared_model_halc)
#  attach(compared_model_halc)
#This was used temporarily to see if other models are significantly better.

# wilcox.test(halc_auc_gbm1, halc_auc_gam, paird=T)

#let's see the mean auc values for all the models
mean_auc_halc
##  halc_auc_glm  halc_auc_gam halc_auc_gbm1 halc_auc_gbm2 
##     0.7417777     0.7419949     0.7510277     0.7437574

ROC(AUC) curves of the various models.

#show the AUC curves from the last iteration on the test data
par(mfrow=c(2,2))
colAUC(halc_glm_pred, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.7623529
colAUC(pred_halc_gam, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.7631373
colAUC(pred_halc_gbm1, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.7694118
colAUC(pred_halc_gbm2, eva$high_use , plotROC = T)

##                     [,1]
## FALSE vs. TRUE 0.7384314
par(mfrow=c(1,1))

Here, I utilised Area Under Curve for evaluating my model. This is because it prevents subjectiveness in selecting thresholds which can significantly affect the predictive performance and commission and omission error. Although, measures, such as prevalence have been recommended for dealing with selection of threshold to conver into binary(e.g, True or False). AUC readily takes care of this by considering all the possible thresholds and evaluates the performance, accordingly. a value of >0.9 is an excellent model while AUC values with range 0.7-0.8, 0.5-0.7, 0.5 are fairly good, poor and very poor respectively. Here, my AUC values for all the models thorugh my boosting and resampling are about 0.7 for the three different models(GLm, GAM, GBM) which I applied.

Model for alcohol use

RESPONSE CURVES AND MODEL INTERPRETATIONS.

####################################################
#Model for alcohol use
#RESPONSE CURVES AND MODEL INTERPRETATIONS.
#Here, i decided to use the entire data for the analysis
#GAM
halc_gam<-gam(high_use~ sex+ s(studytime, k=3) + s(goout, k=3) + 
                s(absences, k=3) , data =data_reg, family = "binomial")
summary(halc_gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## high_use ~ sex + s(studytime, k = 3) + s(goout, k = 3) + s(absences, 
##     k = 3)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.4490     0.1983  -7.308 2.72e-13 ***
## sexM          0.7817     0.2656   2.944  0.00324 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df Chi.sq  p-value    
## s(studytime)   1      1  6.095   0.0136 *  
## s(goout)       1      1 36.715 1.37e-09 ***
## s(absences)    1      1 12.118   0.0005 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.227   Deviance explained = 18.1%
## UBRE = 0.024361  Scale est. = 1         n = 382
#plot the response curves from GAM
plot(halc_gam, pages=1)

From the above, we can see the response of high alcohol use to the various predictors and also the confidence interval. There appears to be lesser tendency of high alcohol use when study time increases. This is expected, as the student has lesser time for social activities and parties. on the other hand, expectedly, going out increases the tendency of high alcohol use. In similar vein, absence from school increases the tendency too. but as it can be seen from the plot, the confidence interval seems to reduce with increase in number of absences, which might be an indication of insufficient data from that region.

To get a deeper, insight, I will explore this further with GBM

#GBM
halc_gbm1<-gbm(formula = high_use~ sex + age+address+Medu+Fedu+
                 Pstatus+ traveltime+studytime+famsup+activities+higher+
                 internet+famrel+romantic+freetime+goout+                          absences, data=data_reg,
               distribution = "bernoulli",n.trees = 2800,                       shrinkage = 0.001, interaction.depth = 6,
               bag.fraction = 0.75)
summary(halc_gbm1)

##                   var     rel.inf
## goout           goout 29.84995571
## absences     absences 18.33637762
## sex               sex 13.70737976
## famrel         famrel  7.64858603
## traveltime traveltime  6.72151217
## Medu             Medu  3.99777971
## studytime   studytime  3.69179555
## freetime     freetime  3.24293645
## age               age  2.84523742
## activities activities  2.56219643
## Fedu             Fedu  2.45674789
## address       address  1.96990003
## famsup         famsup  1.25181331
## romantic     romantic  1.08620694
## internet     internet  0.43600256
## Pstatus       Pstatus  0.10777830
## higher         higher  0.08779413

From the relative importance, we can see that goout, absences, sex and family relationship seem to have the highest effect on high alcohol use. The least important factors can also be seen from the model summary.

Now, i’ll show the response curve from GBM to see the effects

#Now, i'll show the response curve from GBM to see the effects

#plot(halc_gbm1)
best.iter1<-gbm.perf(halc_gbm1, plot.it = F, method = "OOB")

# pred_halc_gbm1<-predict.gbm(halc_gbm1,newdata = eva, best.iter1, type = "response")

par(mfrow=c(2,2))
plot.gbm(halc_gbm1, "goout", best.iter1)
plot.gbm(halc_gbm1, "absences", best.iter1)
plot.gbm(halc_gbm1, "sex", best.iter1)
plot.gbm(halc_gbm1, "famrel", best.iter1)

par(mfrow=c(1,1))
plot.gbm(halc_gbm1, "studytime", best.iter1)

As it can also be seen from the GM reponse curves, absences and going out, increases the tendency of high alcohol use. Going out steeply affects when it is more than average. Absences of slighty higher than 10 times can even increase the tendency. Family relationship however, reduces the tendency. and overall, male students seem to have relatively more high alcohol use than their female counterpart. Also, as shown earlier, more study time appears to reduce the tendency of high alcohol use.

ODDS RATIO

#explore the odd's ratio
halc_glm_mod <- glm(high_use~  sex + studytime + goout + absences, data=data_reg, family = "binomial") 
odds_ra <- exp(coef(halc_glm_mod))
#odds_ra <- coef(high_use_mod2) %>% exp     #alternaive

# compute confidence intervals (conf_int)
conf_int <- exp(confint(halc_glm_mod)) 
#Conf_Int <- high_use_mod2 %>%  confint() %>% exp   #alternative

# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
##                odds_ra      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM        2.18525582 1.30370562 3.7010763
## studytime   0.65628388 0.46510383 0.9099505
## goout       2.06838526 1.64470314 2.6349716
## absences    1.08123884 1.03577383 1.1324143

Here, we can also see that the odd’s ratio of Male students, absences and going out are more than 1, which indicates success: that they all increase the tendency of high alcohol consumption by students. This is not surprising. Absence from school and going out would normally be expected to result in high alcohol use. The confidence interval shows that it is highly likely that these factors affect. On the other hand, study time shows failure with high alcohol consumption which is also not surprising, because when studying more, one would expect that students have less time for going out and getting drunk. These results are all in accordance with the results of the models used earlier.

CONFUSION MATRIX

ERROR IN PREDICTD HIGH ALCOHOL USE. Although, AUC is preferred, as it considers all possible threshold and prevent subjectiveness, I will be exploring a confusion matrix too and 0.5 will be made the threshold for TRUE or FALSE for high alcohol use. More information on creating functions to calculate these metrics in a more automated way can be found here

#fit the model
# predict() the probability of high_use
probs<- predict(halc_glm_mod, type = "response")

#Copy the data again
alc<-data_reg

#add the predicted probabilities to 'alc'
alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)

#use the probabilities to make a prediction of high_use, setting 0.5 as threshold
alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)

#see the first ten and last ten original classes, predicted probabilities, and class predictions
head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
##    failures absences sex high_use prob_high_use predict_high_use
## 1         0        3   F    FALSE    0.03910480            FALSE
## 2         1        2   F     TRUE    0.27212509            FALSE
## 3         0        8   F    FALSE    0.09326837            FALSE
## 4         0        2   F    FALSE    0.05424023            FALSE
## 5         1        5   F     TRUE    0.03386206            FALSE
## 6         0        2   F    FALSE    0.05424023            FALSE
## 7         1        0   F    FALSE    0.04676258            FALSE
## 8         0        1   F    FALSE    0.14322690            FALSE
## 9         0        9   F     TRUE    0.06105537            FALSE
## 10        0       10   F    FALSE    0.12696107            FALSE
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$predict_high_use)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
#Here, I derived the various classification matrix's metrics
#Accuracy: Overall, how often is the classifier correct?
#(TP+TN)/total
accuracy<- (252 + 49)/nrow(alc)
print(paste("The accuracy of the classification is", accuracy))
## [1] "The accuracy of the classification is 0.787958115183246"
#Error rate
print(paste("The error rate/misclassification is", 1-accuracy))
## [1] "The error rate/misclassification is 0.212041884816754"
#True Positive Rate:
#"Sensitivity" or "Recall": When actually yes, how often does it predict yes?
#(TP/actual yes)
print(paste("The True positive rate/Sensitivity is", 49/(65+49)))
## [1] "The True positive rate/Sensitivity is 0.429824561403509"
#False Positive Rate: i.e When actually no, how often does it predict yes?
print(paste("The false positive rate is", 16/(16+252)))
## [1] "The false positive rate is 0.0597014925373134"
#Specificity: When actually no, how often does it predict no?
#TN/actual no
#Alternatively: 1 minus False Positive Rate
print(paste("The Specificity is", 252/(252+16)))
## [1] "The Specificity is 0.940298507462687"
#Precision: When it predicts yes, how often is it correct?
#TP/predicted yes
print(paste("The precision is", 49/(16+49)))
## [1] "The precision is 0.753846153846154"
#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total
print(paste("The prevalence is", (65+49)/nrow(alc)))
## [1] "The prevalence is 0.298429319371728"

Accuracy of the prediction is about 0.79 which is fairly good. Miclassification/error rate is about 0.21 By using threshold of 0.5, we can see that 252 FALSE high alcohol use were Truly/rightly predicted and 16 falsely predicted. There are also 49 high alcohol use that were Truly predicted while 65, wrongly predicted. The prediction seems to be better when predicting no high alcohol use(i.e Specifity) than when predicting the true value(sensitivity. This can also be seen in the plot below.

Plotting the error of prediction.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))

# define the geom as points and draw the plot
g + geom_point()

ALTERNATIVE FOR CALULATING THE ERROR

#Proportion of errors
conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65968586 0.04188482 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.82984293 0.17015707 1.00000000
#mean error of the prediction
mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2120419
#The below is an alternative way by firstly defining the function to calculate the mean error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2120419

K-fold cross-validation

#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = halc_glm_mod, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2120419

Here I used an arbitrary threshold of 0.5 but it has been suggested that prevalence can be used in selection of threshold. you can see more here about terminologies related to confusion/classification matrix.

LINEAR DISCRIMINANT ANALYSIS(LDA).

###############################################################
#LINEAR DISCRIMINANT ANALYSIS.


#copy data again
data_copy2<- alco_data

#filter the numeric variables.
data_num<-Filter(is.numeric, data_copy2)

#Scale the numeric variables
data_num_sca<- data.frame( scale(data_num))

#get the categorised grades created earlier into the vector
G3_cat<-data_mca_fac2[,c("G3")]

#add to to the scaled data frame
data_num_G3<- cbind(data_num_sca, G3_cat)

#summary
summary(data_num_G3)
##       age               Medu              Fedu           traveltime     
##  Min.   :-1.3519   Min.   :-2.5831   Min.   :-2.3402   Min.   :-0.6434  
##  1st Qu.:-0.4997   1st Qu.:-0.7422   1st Qu.:-0.5158   1st Qu.:-0.6434  
##  Median : 0.3525   Median : 0.1783   Median : 0.3964   Median :-0.6434  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3525   3rd Qu.: 1.0988   3rd Qu.: 1.3086   3rd Qu.: 0.7939  
##  Max.   : 4.6133   Max.   : 1.0988   Max.   : 1.3086   Max.   : 3.6683  
##    studytime           failures           famrel            freetime      
##  Min.   :-1.23721   Min.   :-0.3458   Min.   :-3.24319   Min.   :-2.2541  
##  1st Qu.:-1.23721   1st Qu.:-0.3458   1st Qu.: 0.06937   1st Qu.:-0.2233  
##  Median :-0.04374   Median :-0.3458   Median : 0.06937   Median :-0.2233  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:-0.04374   3rd Qu.:-0.3458   3rd Qu.: 1.17356   3rd Qu.: 0.7921  
##  Max.   : 2.34320   Max.   : 4.8004   Max.   : 1.17356   Max.   : 1.8075  
##      goout              Dalc              Walc             health       
##  Min.   :-1.8897   Min.   :-0.5434   Min.   :-1.0162   Min.   :-1.8397  
##  1st Qu.:-0.9952   1st Qu.:-0.5434   1st Qu.:-1.0162   1st Qu.:-0.4099  
##  Median :-0.1007   Median :-0.5434   Median :-0.2320   Median : 0.3051  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7938   3rd Qu.: 0.5847   3rd Qu.: 0.5522   3rd Qu.: 1.0200  
##  Max.   : 1.6883   Max.   : 3.9691   Max.   : 2.1206   Max.   : 1.0200  
##     absences             G1                G2                G3         
##  Min.   :-0.8238   Min.   :-3.5147   Min.   :-2.6409   Min.   :-3.4614  
##  1st Qu.:-0.6407   1st Qu.:-0.5517   1st Qu.:-0.5185   1st Qu.:-0.4405  
##  Median :-0.2746   Median : 0.1891   Median : 0.1889   Median : 0.1637  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2746   3rd Qu.: 0.9298   3rd Qu.: 0.8963   3rd Qu.: 0.7679  
##  Max.   : 7.4139   Max.   : 2.4113   Max.   : 2.3112   Max.   : 1.9763  
##     alc_use          G3_cat   
##  Min.   :-0.9024   vlow :143  
##  1st Qu.:-0.9024   low  :107  
##  Median :-0.3947   high : 64  
##  Mean   : 0.0000   vhigh: 68  
##  3rd Qu.: 0.6207              
##  Max.   : 3.1592

**DIVIDE DATA INTO TEST AND TRAIN DATA FOR VALIDATION.

#now, divide into 80:20 train:test data
samp <- sample(1:nrow(data_num_G3), size = nrow(data_num_G3)*0.8)
train <- data_num_G3[samp,]
test <- data_num_G3[-samp,]


#Dimension.
dim(train)
## [1] 305  18
#remove the grades columns
train$G1<-train$G2<-train$G3<-NULL

#check the column names now
colnames(train)
##  [1] "age"        "Medu"       "Fedu"       "traveltime" "studytime" 
##  [6] "failures"   "famrel"     "freetime"   "goout"      "Dalc"      
## [11] "Walc"       "health"     "absences"   "alc_use"    "G3_cat"

fitting the categorised grade to all other variables

#fit the categorised grade to all other variables
lda.fit <- lda(G3_cat~., data = train)
#lda.fit

#create arrow function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#convert the categorised grade into numeric for the LDA plot
train$G3_cat <- as.numeric(train$G3_cat)

#plot
plot(lda.fit, dimen = 2, col = train$G3, pch= train$G3)
lda.arrows(lda.fit, myscale = 2)

failures in the past is contributing most to poor grades. Absences also appear to be. Mother’s education level seems to be a factor contributing to students’ success.

LDA CLASSIFICATION ACCURACY

#see the class prediction accuracy
G3_cat<-test$G3_cat
test<-dplyr::select(test, -G3_cat)
lda.pred<-predict(lda.fit, newdata = test)
table(correct = G3_cat, predicted = lda.pred$class)
##        predicted
## correct vlow low high vhigh
##   vlow    18  12    0     2
##   low     10   7    0     1
##   high     4   7    0     1
##   vhigh    2   4    1     8

The table above shows the right classification and misclassification.

Distance between variables.

#Now, see the distance between variables.
dist_sca<-dist(data_num_sca)
summary(dist_sca)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5492  4.4994  5.4263  5.6181  6.5457 13.4290

Next, cluster the data better using k-means clustering and refit the LDA using the clustered data.

First, determine the optimal number of clusters using the elbow method of TWCSS.

Distance between variables.

set.seed(123) #This is used to make the values constant when re-run
# determine the number of clusters
k_max <- 20

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(data_num_sca, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = c('point', 'line'))

############################################################
#some other methods of exploring optimal number of classes
# library(cluster)
# library(factoextra)
# # reVisualize k-means clusters
# km.res <- kmeans(data_num_sca, 3, nstart = 25)
# fviz_cluster(km.res, data = data_num_sca, geom = "point",
#              stand = FALSE, frame.type = "norm")
# 
# #install.packages("factoextra")
# require(cluster)
# fviz_nbclust(data_num_sca, kmeans, method = "silhouette")
#############################################################

**Further, use the NbClust to confirm. There are quite many others too. Two others are in the previous code chunk where the packages “cluster” and “factoextra” can be used. See more here for other cluster methods of checking the optimal number of classes.

#install.packages("cluster")
library(NbClust)
nb <- NbClust(data_num_sca, diss=NULL, distance = "euclidean", 
              min.nc=2, max.nc=5, method = "kmeans", 
              index = "all", alphaBeale = 0.1)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 13 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
#Based on these, I will make my number of clusters, 3

# k-means clustering
#use the euclidean distance calculated earlier.
km <-kmeans(dist_sca, centers = 3)

#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=data_num_sca)


plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)

Seems past failure and alcohol use also causes poor performance

#copydata again
data_copy2<-alco_data
data_num2<-Filter(is.numeric, data_copy2)
data_LDA1_sca<- data.frame( scale(data_num2))
summary(data_LDA1_sca)
##       age               Medu              Fedu           traveltime     
##  Min.   :-1.3519   Min.   :-2.5831   Min.   :-2.3402   Min.   :-0.6434  
##  1st Qu.:-0.4997   1st Qu.:-0.7422   1st Qu.:-0.5158   1st Qu.:-0.6434  
##  Median : 0.3525   Median : 0.1783   Median : 0.3964   Median :-0.6434  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3525   3rd Qu.: 1.0988   3rd Qu.: 1.3086   3rd Qu.: 0.7939  
##  Max.   : 4.6133   Max.   : 1.0988   Max.   : 1.3086   Max.   : 3.6683  
##    studytime           failures           famrel            freetime      
##  Min.   :-1.23721   Min.   :-0.3458   Min.   :-3.24319   Min.   :-2.2541  
##  1st Qu.:-1.23721   1st Qu.:-0.3458   1st Qu.: 0.06937   1st Qu.:-0.2233  
##  Median :-0.04374   Median :-0.3458   Median : 0.06937   Median :-0.2233  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:-0.04374   3rd Qu.:-0.3458   3rd Qu.: 1.17356   3rd Qu.: 0.7921  
##  Max.   : 2.34320   Max.   : 4.8004   Max.   : 1.17356   Max.   : 1.8075  
##      goout              Dalc              Walc             health       
##  Min.   :-1.8897   Min.   :-0.5434   Min.   :-1.0162   Min.   :-1.8397  
##  1st Qu.:-0.9952   1st Qu.:-0.5434   1st Qu.:-1.0162   1st Qu.:-0.4099  
##  Median :-0.1007   Median :-0.5434   Median :-0.2320   Median : 0.3051  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7938   3rd Qu.: 0.5847   3rd Qu.: 0.5522   3rd Qu.: 1.0200  
##  Max.   : 1.6883   Max.   : 3.9691   Max.   : 2.1206   Max.   : 1.0200  
##     absences             G1                G2                G3         
##  Min.   :-0.8238   Min.   :-3.5147   Min.   :-2.6409   Min.   :-3.4614  
##  1st Qu.:-0.6407   1st Qu.:-0.5517   1st Qu.:-0.5185   1st Qu.:-0.4405  
##  Median :-0.2746   Median : 0.1891   Median : 0.1889   Median : 0.1637  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2746   3rd Qu.: 0.9298   3rd Qu.: 0.8963   3rd Qu.: 0.7679  
##  Max.   : 7.4139   Max.   : 2.4113   Max.   : 2.3112   Max.   : 1.9763  
##     alc_use       
##  Min.   :-0.9024  
##  1st Qu.:-0.9024  
##  Median :-0.3947  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6207  
##  Max.   : 3.1592
#removwe the alcohol columns
data_LDA1_sca$alc_use<-data_LDA1_sca$Walc<-data_LDA1_sca$Dalc<-NULL

#see colmn names
colnames(data_LDA1_sca)
##  [1] "age"        "Medu"       "Fedu"       "traveltime" "studytime" 
##  [6] "failures"   "famrel"     "freetime"   "goout"      "health"    
## [11] "absences"   "G1"         "G2"         "G3"
#Get the alcohol use into a cector
alc_use<-alco_data[,c("alc_use")]

#combine with the scaled data
data_LDA2<- cbind(data_LDA1_sca, alc_use)


#summary(data_LDA2)

lda.fit2 <- lda(alc_use~., data = data_LDA2)
#lda.fit


#first, determine the optimal number of clusters using twcss.
set.seed(123)
# determine the number of clusters
k_max <- 10
alc_sca<-as.data.frame(scale(data_LDA2))
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(alc_sca, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
dist_alc<- dist(alc_sca)
km <-kmeans(dist_sca, centers = 4)


#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=alc_sca)
lda.fit_clus
## Call:
## lda(km$cluster ~ ., data = alc_sca)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.24345550 0.23036649 0.09162304 0.43455497 
## 
## Group means:
##           age        Medu        Fedu traveltime   studytime   failures
## 1 -0.15150389  0.47524588  0.31793533 -0.2879191  0.34125036 -0.3273272
## 2  0.09100733 -0.26100930 -0.13226160  0.2385657 -0.27429626  0.2974988
## 3  0.64464336 -0.40027789 -0.56793020  0.9170400 -0.58932564  1.5656617
## 4 -0.09928495 -0.04348989  0.01173851 -0.1585163  0.07848304 -0.3044375
##        famrel     freetime       goout      health   absences         G1
## 1  0.06937305 -0.037671026 -0.38924340 -0.20230345 -0.2411265  1.2245145
## 2 -0.33214977  0.007491072  0.22458595  0.11007841  0.2933109 -0.6316473
## 3 -0.11991628  0.327936363  0.56380349  0.08036769  0.8028458 -0.8691426
## 4  0.16249732 -0.052009528 -0.01986174  0.03803886 -0.1896759 -0.1679210
##           G2          G3    alc_use
## 1  1.2196444  1.15119286 -0.4930008
## 2 -0.6511951 -0.66705988  0.4591347
## 3 -0.9430175 -0.98425931  1.4475137
## 4 -0.1392539 -0.08379874 -0.2723962
## 
## Coefficients of linear discriminants:
##                    LD1         LD2          LD3
## age        -0.07064798 -0.10742109 -0.154803905
## Medu       -0.21614373 -0.28946292 -0.292725530
## Fedu       -0.04019076  0.11655127  0.358301628
## traveltime  0.30661476 -0.31082720 -0.012004516
## studytime  -0.03182434 -0.07466985 -0.124071455
## failures    0.38617707 -0.87803992 -0.248719853
## famrel     -0.08948820 -0.04204097 -0.813033662
## freetime   -0.03045788 -0.13204798 -0.033243313
## goout       0.08390191  0.25133500  0.003604871
## health     -0.02027092  0.12273177  0.098613125
## absences    0.37848441 -0.27521538  0.221078909
## G1         -0.36587629 -0.65489022  0.378096333
## G2         -0.25588535 -0.69491291  0.318987655
## G3         -0.63484376  0.34177356 -0.616701007
## alc_use     0.85244671 -0.37363870 -0.003084369
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8474 0.1444 0.0083
#plot
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)

Here, we can see that going out, free time and absences tend tend to cause high alcohol use. On the other hand, study tume, family relationship tend to reduce it.

3D VISUALISATION OF THE CLUSTERS

model_predictors <- dplyr::select(data_LDA2, -alc_use)

# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
#using the plotly package for 3D plotting of the matrix products' columns.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=data_LDA2$alc_use)